87 research outputs found

    SChloro: directing Viridiplantae proteins to six chloroplastic sub-compartments

    Get PDF
    Motivation: Chloroplasts are organelles found in plants and involved in several important cell processes. Similarly to other compartments in the cell, chloroplasts have an internal structure comprising several sub-compartments, where different proteins are targeted to perform their functions. Given the relation between protein function and localization, the availability of effective computational tools to predict protein sub-organelle localizations is crucial for large-scale functional studies. Results: In this paper we present SChloro, a novel machine-learning approach to predict protein sub-chloroplastic localization, based on targeting signal detection and membrane protein information. The proposed approach performs multi-label predictions discriminating six chloroplastic sub-compartments that include inner membrane, outer membrane, stroma, thylakoid lumen, plastoglobule and thylakoid membrane. In comparative benchmarks, the proposed method outperforms current state-of-the-art methods in both single-and multi-compartment predictions, with an overall multi-label accuracy of 74%. The results demonstrate the relevance of the approach that is eligible as a good candidate for integration into more general large-scale annotation pipelines of protein subcellular localization

    Machine-learning methods for structure prediction of β-barrel membrane proteins

    Get PDF
    Different types of proteins exist with diverse functions that are essential for living organisms. An important class of proteins is represented by transmembrane proteins which are specifically designed to be inserted into biological membranes and devised to perform very important functions in the cell such as cell communication and active transport across the membrane. Transmembrane β-barrels (TMBBs) are a sub-class of membrane proteins largely under-represented in structure databases because of the extreme difficulty in experimental structure determination. For this reason, computational tools that are able to predict the structure of TMBBs are needed. In this thesis, two computational problems related to TMBBs were addressed: the detection of TMBBs in large datasets of proteins and the prediction of the topology of TMBB proteins. Firstly, a method for TMBB detection was presented based on a novel neural network framework for variable-length sequence classification. The proposed approach was validated on a non-redundant dataset of proteins. Furthermore, we carried-out genome-wide detection using the entire Escherichia coli proteome. In both experiments, the method significantly outperformed other existing state-of-the-art approaches, reaching very high PPV (92%) and MCC (0.82). Secondly, a method was also introduced for TMBB topology prediction. The proposed approach is based on grammatical modelling and probabilistic discriminative models for sequence data labeling. The method was evaluated using a newly generated dataset of 38 TMBB proteins obtained from high-resolution data in the PDB. Results have shown that the model is able to correctly predict topologies of 25 out of 38 protein chains in the dataset. When tested on previously released datasets, the performances of the proposed approach were measured as comparable or superior to the current state-of-the-art of TMBB topology prediction

    BUSCA: An integrative web server to predict subcellular localization of proteins

    Get PDF
    Here, we present BUSCA (http://busca.biocomp.unibo.it), a novel web server that integrates different computational tools for predicting protein subcellular localization. BUSCA combines methods for identifying signal and transit peptides (DeepSig and TPpred3), GPI-anchors (PredGPI) and transmembrane domains (ENSEMBLE3.0 and BetAware) with tools for discriminating subcellular localization of both globular and membrane proteins (BaCelLo, MemLoci and SChloro). Outcomes from the different tools are processed and integrated for annotating subcellular localization of both eukaryotic and bacterial protein sequences. We benchmark BUSCA against protein targets derived from recent CAFA experiments and other specific data sets, reporting performance at the state-of-the-art. BUSCA scores better than all other evaluated methods on 2732 targets from CAFA2, with a F1 value equal to 0.49 and among the best methods when predicting targets from CAFA3. We propose BUSCA as an integrated and accurate resource for the annotation of protein subcellular localization

    Large scale analysis of protein stability in OMIM disease related human protein variants

    Get PDF
    Modern genomic techniques allow to associate several Mendelian human diseases to single residue variations in different proteins. Molecular mechanisms explaining the relationship among genotype and phenotype are still under debate. Change of protein stability upon variation appears to assume a particular relevance in annotating whether a single residue substitution can or cannot be associated to a given disease. Thermodynamic properties of human proteins and of their disease related variants are lacking. In the present work, we take advantage of the available three dimensional structure of human proteins for predicting the role of disease related variations on the perturbation of protein stability

    ISPRED4: interaction sites PREDiction in protein structures with a refining grammar model

    Get PDF
    The identification of protein-protein interaction (PPI) sites is an important step towards the characterization of protein functional integration in the cell complexity. Experimental methods are costly and time-consuming and computational tools for predicting PPI sites can fill the gaps of PPI present knowledge. We present ISPRED4, an improved structure-based predictor of PPI sites on unbound monomer surfaces. ISPRED4 relies on machine-learning methods and it incorporates features extracted from protein sequence and structure. Cross-validation experiments are carried out on a new dataset that includes 151 high-resolution protein complexes and indicate that ISPRED4 achieves a per-residue Matthew Correlation Coefficient of 0.48 and an overall accuracy of 0.85. Benchmarking results show that ISPRED4 is one of the top-performing PPI site predictors developed so far

    Mapping human disease-associated enzymes into Reactome allows characterization of disease groups and their interactions

    Get PDF
    According to databases such as OMIM, Humsavar, Clinvar and Monarch, 1494 human enzymes are presently associated to 2539 genetic diseases, 75% of which are rare (with an Orphanet code). The Mondo ontology initiative allows a standardization of the disease name into specific codes, making it possible a computational association between genes, variants, diseases, and their effects on biological processes. Here, we tackle the problem of which biological processes enzymes can affect when the protein variant is disease-associated. We adopt Reactome to describe human biological processes, and by mapping disease-associated enzymes in the Reactome pathways, we establish a Reactome-disease association. This allows a novel categorization of human monogenic and polygenic diseases based on Reactome pathways and reactions. Our analysis aims at dissecting the complexity of the human genetic disease universe, highlighting all the possible links within diseases and Reactome pathways. The novel mapping helps understanding the biochemical/molecular biology of the disease and allows a direct glimpse on the present knowledge of other molecules involved. This is useful for a complete overview of the disease molecular mechanism/s and for planning future investigations. Data are collected in DAR, a database that is free for search and available at https://dar.biocomp.unibo.it

    DeepSig: Deep learning improves signal peptide detection in proteins

    Get PDF
    Motivation: The identification of signal peptides in protein sequences is an important step toward protein localization and function characterization. Results: Here, we present DeepSig, an improved approach for signal peptide detection and cleavage-site prediction based on deep learning methods. Comparative benchmarks performed on an updated independent dataset of proteins show that DeepSig is the current best performing method, scoring better than other available state-of-the-art approaches on both signal peptide detection and precise cleavage-site identification. Availability and implementation: DeepSig is available as both standalone program and web server at https://deepsig.biocomp.unibo.it. All datasets used in this study can be obtained from the same website

    Cloning the barley nec3 disease lesion mimic mutant using complementation by sequencing

    Get PDF
    Disease lesion mimic (DLM) or necrotic mutants display necrotic lesions in the absence of pathogen infections. They can show improved resistance to some pathogens and their molecular dissection can contribute to revealing components of plant defense pathways. Although forward-genetics strategies to find genes causal to mutant phenotypes are available in crops, these strategies require the production of experimental cross populations, mutagenesis, or gene editing and are time- and resource-consuming or may have to deal with regulated plant materials. In this study, we described a collection of 34 DLM mutants in barley (Hordeum vulgare L.) and applied a novel method called complementation by sequencing (CBS), which enables the identification of the gene responsible for a mutant phenotype given the availability of two or more chemically mutagenized individuals showing the same phenotype. Complementation by sequencing relies on the feasibility to obtain all induced mutations present in chemical mutants and on the low probability that different individuals share the same mutated genes. By CBS, we identified a cytochrome P450 CYP71P1 gene as responsible for orange blotch DLM mutants, including the historical barley nec3 locus. By comparative phylogenetic analysis we showed that CYP71P1 gene family emerged early in angiosperm evolution but has been recurrently lost in some lineages including Arabidopsis thaliana (L.) Heynh. Complementation by sequencing is a straightforward cost-effective approach to clone genes controlling phenotypes in a chemically mutagenized collection. The TILLMore (TM) collection will be instrumental for understanding the molecular basis of DLM phenotypes and to contribute knowledge about mechanisms of host-pathogen interaction

    Resources and tools for rare disease variant interpretation

    Get PDF
    : Collectively, rare genetic disorders affect a substantial portion of the world's population. In most cases, those affected face difficulties in receiving a clinical diagnosis and genetic characterization. The understanding of the molecular mechanisms of these diseases and the development of therapeutic treatments for patients are also challenging. However, the application of recent advancements in genome sequencing/analysis technologies and computer-aided tools for predicting phenotype-genotype associations can bring significant benefits to this field. In this review, we highlight the most relevant online resources and computational tools for genome interpretation that can enhance the diagnosis, clinical management, and development of treatments for rare disorders. Our focus is on resources for interpreting single nucleotide variants. Additionally, we present use cases for interpreting genetic variants in clinical settings and review the limitations of these results and prediction tools. Finally, we have compiled a curated set of core resources and tools for analyzing rare disease genomes. Such resources and tools can be utilized to develop standardized protocols that will enhance the accuracy and effectiveness of rare disease diagnosis

    Grammatical-Restrained Hidden Conditional Random Fields for Bioinformatics applications

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Discriminative models are designed to naturally address classification tasks. However, some applications require the inclusion of grammar rules, and in these cases generative models, such as Hidden Markov Models (HMMs) and Stochastic Grammars, are routinely applied.</p> <p>Results</p> <p>We introduce Grammatical-Restrained Hidden Conditional Random Fields (GRHCRFs) as an extension of Hidden Conditional Random Fields (HCRFs). GRHCRFs while preserving the discriminative character of HCRFs, can assign labels in agreement with the production rules of a defined grammar. The main GRHCRF novelty is the possibility of including in HCRFs prior knowledge of the problem by means of a defined grammar. Our current implementation allows <it>regular grammar </it>rules. We test our GRHCRF on a typical biosequence labeling problem: the prediction of the topology of Prokaryotic outer-membrane proteins.</p> <p>Conclusion</p> <p>We show that in a typical biosequence labeling problem the GRHCRF performs better than CRF models of the same complexity, indicating that GRHCRFs can be useful tools for biosequence analysis applications.</p> <p>Availability</p> <p>GRHCRF software is available under GPLv3 licence at the website</p> <p><url>http://www.biocomp.unibo.it/~savojard/biocrf-0.9.tar.gz.</url></p
    • …
    corecore